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Abstract  -  A  major  source  of  signal  degradation  in  real  environments  is  room  reverberation. 
Monaural  speech  segregation  in  reverberant  environments  is  a  particularly  challenging  prob¬ 
lem.  Although  inverse  filtering  has  been  proposed  to  partially  restore  the  harmonicity  of 
reverberant  speech  before  segregation,  this  approach  is  sensitive  to  specific  source/receiver 
and  room  configurations.  This  study  proposes  a  supervised  learning  approach  to  monaural 
segregation  of  reverberant  voiced  speech,  which  learns  to  map  from  a  set  of  pitch-based  audi¬ 
tory  features  to  a  grouping  cue  encoding  the  posterior  probability  of  a  time- frequency  (T-F) 
unit  being  target  dominant  given  observed  features.  We  devise  a  novel  objective  function  for 
the  learning  process,  which  directly  relates  to  the  goal  of  maximizing  signal-to-noise  ratio.  The 
models  trained  using  this  objective  function  yield  significantly  better  T-F  unit  labeling.  A 
segmentation  and  grouping  framework  is  utilized  to  form  reliable  segments  under  reverberant 
conditions  and  organize  them  into  streams.  Systematic  evaluations  show  that  our  approach 
produces  very  promising  results. 

Index  Terms  -  Computational  auditory  scene  analysis,  monaural  segregation,  room  reverber¬ 
ation,  speech  separation,  supervised  learning. 
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1  Introduction 

Room  reverberation  happens  in  everyday  listening  and  it  creates  a  considerable  challenge  to 
speech  separation.  While  humans  excel  in  “hearing  out”  a  target  source  from  sound  mixtures 
in  noisy  and  reverberant  conditions,  simulating  this  perceptual  ability  remains  a  fundamental 
challenge  [53].  This  study  is  concerned  with  monaural  separation  of  reverberant  voiced  speech, 
in  which  only  monaural  recordings  are  available.  Monaural  speech  separation  has  many  ap¬ 
plications  including  hearing  aid  design  and  noise  removal  for  automatic  speech  recognition. 

Various  methods  have  been  proposed  for  monaural  speech  enhancement  or  separation,  in¬ 
cluding  spectral  subtraction  [8],  Wiener  filtering  [13,33],  minimum  mean  square  error  (MMSE) 
estimation  [18],  and  subspace  analysis  [19].  However,  strong  assumptions  about  the  interfer¬ 
ence  (e.g.,  quasi-stationarity)  in  these  methods  limit  their  application  in  dealing  with  a  general 
acoustic  background.  A  class  of  speech  separation  algorithms  models  the  underlying  sources 
and  fits  the  learned  models  to  the  observations.  The  essence  of  such  algorithms  is  that  the  ex¬ 
pected  patterns  of  the  sources  are  extracted  through  training  and  then  those  patterns  whose 
combinations  best  match  the  observed  signal  are  selected  to  estimate  the  sources.  In  [46], 
speaker  dependent  Hidden  Markov  Models  (HMMs)  are  trained  and  combined  into  a  factorial 
HMM  architecture  for  computing  a  masking  function  for  separation.  A  modeling  technique 
based  on  composite  source  modeling  is  proposed  in  [41]  to  model  each  source  using  a  set  of 
Gaussian  subsources.  A  soft  mask  filter  is  then  derived  using  MMSE  estimation  for  separating 
the  sources.  These  approaches  can  offer  satisfactory  solutions  if  extracted  source  characteris¬ 
tics  match  the  statistical  properties  of  mixed  signals.  This  is,  however,  not  always  true  leading 
to  some  adaptation  schemes  for  adjusting  source  models  [55].  Rather  than  modeling  each  in¬ 
dividual  source,  the  relationships  between  sources  can  also  be  learned  using  a  discriminant 
method.  One  example  is  the  spectral  learning  approach  which  is  based  on  parameterized  affin¬ 
ity  matrices  built  from  low-level  features  and  solves  the  separation  problem  by  formulating  it 
as  segmentation  in  a  time-frequency  (T-F)  plane  [4],  The  performance  of  these  methods  is 
unclear  in  reverberant  conditions. 

Inspired  by  human  auditory  perception  [10],  computational  auditory  scene  analysis  (CASA) 
aims  to  separate  a  mixture  of  sources  into  different  auditory  streams  based  on  perceptual 
principles  [53].  CASA  systems  have  significantly  advanced  the  state-of-the-art  performance 
in  monaural  separation  [11,24,52,54],  An  ideal  binary  T-F  mask  has  been  proposed  as  a 
computational  goal  of  CASA  [51].  Such  a  mask  can  be  constructed  from  prior  knowledge 
of  target  and  interference;  specifically,  a  value  of  1  in  the  mask  indicates  that  the  target  is 
stronger  than  interference  and  0  otherwise.  Studies  show  that  speech  reconstructed  from  the 
idea  binary  mask  produces  large  improvement  in  human  speech  intelligibility  [3,12,32],  Such 
a  goal  has  been  shown  to  still  be  reasonable  when  room  reverberation  is  present  [39,44], 

Segmentation  and  grouping  are  two  main  stages  of  CASA  [53].  In  segmentation,  the  input 
is  decomposed  into  T-F  segments,  each  of  which  is  a  contiguous  region  and  deemed  to  mainly 
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originate  from  a  single  source;  in  grouping,  those  segments  that  likely  come  from  the  same 
source  are  grouped  into  a  stream  by  using  pitch  and  other  grouping  cues.  Pitch,  or  harmonic 
structure,  has  long  been  studied  as  a  prominent  characteristic  of  speech  signals  and  offers 
a  major  cue  for  a  listener  to  separate  target  speech  from  other  sounds  [10].  The  pitch  cue 
has  been  applied  successfully  in  monaural  CASA  algorithms  under  anechoic  conditions  (e.g., 
in  [24]).  However,  the  harmonic  structure  is  distorted  by  reverberation  as  reflections  of  each 
harmonic  combine  with  the  direct  sound.  Due  to  the  weakened  harmonicity  of  the  speech 
signal,  the  performance  of  monaural  CASA  systems  suffers  in  room  reverberation. 

To  restore  speech  harmonicity,  a  commonly  used  method  is  to  estimate  and  apply  an  in¬ 
verse  filter  of  the  room  impulse  response  (RIR)  corresponding  to  the  target  and  microphone 
locations  [21,45,56].  Although  inverse  filtering  can  partially  counteract  the  smearing  effect 
of  reverberation  on  the  speech  spectrum,  it  assumes  that  a  room  configuration,  e.g.,  room 
dimensions,  wall  reflection  coefficients,  source  and  microphone  locations,  etc.,  is  stationary. 
Any  change  in  room  configuration,  like  a  source  movement,  degrades  the  performance  signifi¬ 
cantly  [9,42],  To  quantify  such  effects,  we  systematically  evaluate  the  sensitivity  of  inverse  fil¬ 
tering  to  a  number  of  room  configurations  and  different  reverberation  times  (Tqq).  As  another 
drawback  of  inverse  filtering,  the  filter  estimation  requires  the  absence  of  interference  [45], 
which  is  unrealistic  for  real  world  application. 

In  this  paper,  we  propose  a  supervised  learning  approach  to  achieve  robustness  against  re¬ 
verberation  effects  for  voiced  speech  segregation.  We  first  extract  a  set  of  pitch-based  features 
from  a  reverberant  signal  and  train  a  multilayer  perceptron  (MLP)  for  each  channel  of  a  gam- 
matone  hlterbank  to  estimate  a  grouping  cue  within  each  T-F  unit.  The  grouping  cue  encodes 
the  posterior  probability  of  a  T-F  unit  being  target  dominant  given  observed  features.  We 
then  analyze  the  goal  of  maximizing  signal-to-noise  ratio  (SNR)  in  segregation  and  formulate 
an  objective  function  for  MLP  training.  The  objective  function  takes  into  account  unit-wise 
errors  in  a  generalized  form  of  mean  squared  error  (MSE).  Since  it  is  a  continuous  function  of 
model  parameters,  an  error  backpropagation  technique  can  be  devised  in  order  to  maximize 
SNR.  A  new  segmentation  method  is  proposed  to  achieve  reliable  estimation  of  auditory  seg¬ 
ments,  which  helps  combine  grouping  cues.  Specifically,  we  use  cross-channel  correlation  and 
temporal  continuity  for  segmentation  in  the  low-frequency  range  because  they  are  relatively 
robust  to  reverberation.  In  the  high-frequency  range,  onset/offset  detection  [26]  is  applied 
to  capture  intensity  variation  and  form  segments  by  matching  pairs  of  detected  onsets  and 
offsets.  Because  the  direct  path  from  a  source  is  the  shortest  path,  onset  detection  is  relatively 
unaffected  by  reverberation  [34],  Finally,  the  grouping  stage  organizes  segments  into  streams 
by  combining  grouping  cues. 

The  paper  is  organized  as  follows.  In  the  next  section,  we  present  an  overview  of  the 
proposed  system.  Section  3  describes  how  to  extract  pitch-based  features  and  perform  MLP 
learning.  Detailed  description  of  the  segmentation  and  grouping  stage  is  presented  in  Sec¬ 
tion  4.  Section  5  provides  experimental  results  and  comparisons.  Section  6  analyzes  system 
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Figure  1:  Schematic  diagram  of  the  proposed  system.  A  reverberant  mixture  is  processed  in  a  three-stage 
system.  The  first  stage  analyzes  the  input  signal  by  an  auditory  filterbank  in  successive  time  frames  and 
extracts  pitch-based  features  within  each  time-frequency  unit.  In  the  second  stage,  MLP  is  trained  in  every 
channel  to  associate  those  features  with  the  grouping  cues.  T-F  units  are  then  labeled  according  to  a  criterion 
based  on  the  MLP  output.  The  last  stage  performs  segmentation  and  grouping.  A  target  stream  together 
with  its  background  is  formed. 

robustness  quantitatively  at  the  feature  level.  We  discuss  related  issues  and  conclude  the 
paper  in  Section  7. 


2  System  Overview 

As  shown  in  Fig.  1,  the  proposed  system  consists  of  three  stages.  The  first  stage  analyzes 
the  input  signal  in  the  time-frequency  domain  using  an  auditory  periphery  model.  A  T-F 
unit  corresponds  to  a  certain  channel  in  the  filterbank  at  a  certain  time  frame.  Normalized 
correlograms  are  then  computed.  In  order  to  detect  both  resolved  and  unresolved  harmonics, 
auditory  features  are  extracted  based  on  both  the  filter  response  and  the  response  envelope 
within  each  of  the  units.  Section  3.1  gives  the  detail  of  this  stage. 

The  next  stage  is  to  label  each  of  the  T-F  units  using  MLP.  Previous  studies  [24,45]  treat 
a  single  pitch-based  feature  as  the  grouping  cue  and  rely  on  thresholding  for  unit  labeling. 
Under  reverberant  conditions,  such  pitch-based  features  are  no  longer  reliable  due  to  smeared 
harmonicity.  Therefore,  we  use  multiple  features  to  capture  harmonicity  within  a  unit  and 
label  it  using  a  trained  classifier.  Our  training  objective  is  to  maximize  the  SNR  performance 
instead  of  minimizing  unit  labeling  errors.  This  new  objective  function  makes  the  learning 
process  cost  sensitive  and  yields  good  performance.  Labeling  a  unit  based  on  the  MLP  output 
is  discussed  from  a  probabilistic  perspective.  Essentially,  the  MLP  output  is  translated  into 
the  posterior  probability  of  a  T-F  unit  belonging  to  the  target  and  a  labeling  criterion  is 
consequently  derived.  This  part  is  described  in  Section  3.2  and  3.3. 

Segmentation  and  grouping  take  place  in  the  third  stage.  It  integrates  local  unit  labels 
using  more  global  information  about  the  source  and  segregates  the  mixture  into  a  target  and 
a  background  stream.  To  improve  segmentation  in  reverberant  conditions,  we  apply  different 
strategies  in  different  frequency  ranges.  Specifically,  segmentation  in  low  frequency  merges 
T-F  units  using  cross-channel  correlation  and  temporal  continuity.  In  high  frequency,  onset 
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and  offset  detection  is  utilized.  It  is  expected  that  onset  cues  are  robust  to  room  reverberation 
in  the  light  of  the  precedence  effect  [34] ,  which  refers  to  the  perceptual  importance  of  a  direct 
sound  or  signal  onset.  Once  segments  are  formed,  they  are  grouped  into  the  target  or  the 
interference  stream  resulting  in  a  binary  mask  where  all  T-F  units  are  labeled  1  for  the  target 
stream  and  0  elsewise.  This  mask  gives  an  estimate  of  the  ideal  binary  mask  and  is  used  to 
resynthesize  segregated  target  speech.  Details  are  presented  in  Section  4. 


3  Learning  Grouping  Cues 

Our  goal  is  to  learn  from  a  reverberant  mixture  detectable  cues  that  indicate  whether  target 
speech  dominates  in  each  T-F  unit.  Specifically,  we  learn  a  mapping  from  a  set  of  pitch-based 
features  to  a  grouping  cue,  which  encodes  the  posterior  probability  of  a  T-F  unit  being  target 
dominant. 


3.1  Feature  Extraction 

To  extract  pitch-based  features,  an  input  mixture  x{t )  is  first  decomposed  into  the  time- 
frequency  domain  using  a  gammatone  hlterbank  and  time  windowing.  This  hlterbank  is  a 
standard  model  of  cochlear  filtering  and  is  derived  from  psychophysical  studies  of  the  au¬ 
ditory  periphery  [40].  We  use  a  128-channel  hlterbank  whose  center  frequencies  are  quasi- 
logarithmically  spaced  from  50  Hz  to  8000  Hz.  The  response  of  a  filter  channel  is  further 
transduced  by  the  Meddis  model  of  auditory  nerve  transduction  [36],  which  simulates  the 
nonlinear  properties  of  inner  hair  cells  and  produces  the  bring  rate  of  an  auditory  nerve  fiber, 
denoted  by  h(c,t).  Note  that  h(c,t )  retains  the  original  sampling  frequency.  In  each  channel, 
the  output  is  divided  into  20-ms  time  frames  with  10-ms  overlapping  between  consecutive 
frames.  The  resulting  time-frequency  representation  is  called  a  cochleagram  with  ucm  denot¬ 
ing  a  T-F  unit  for  frequency  channel  c  and  time  frame  m.  This  is  a  standard  procedure  of 
peripheral  analysis  in  CASA  and  implementation  details  are  presented  in  [53]. 

Then,  the  normalized  correlogram  A(c,  m,  r)  for  ucm  for  time  lag  r  is  computed  by  the 
following  autocorrelation: 


A(c,  m,  r)  = 


N/2 

h(c,  mN/2# 

n=—N/2 

n)h(c ,  mN/2  +  n  +  r ) 

N/2 

N/2 

\ 

J2  h2(c,  mN/2  +  n). 

h2(c,  mN/2  +  n  +  r ) 

\ 

n=—N/2  \ 

n=-N/2 

(1) 


where  N  denotes  the  frame  size.  Since  we  use  input  mixtures  sampled  at  16  kHz,  N  =  320. 
The  range  of  the  normalized  correlogram  is  [0, 1]  with  value  1  at  zero  time  lag. 
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Following  [23],  we  construct  a  feature  vector  for  each  ucm  as: 

1  A(c,m,Tm ) 

[/(c,  m)  ■  rm 

f(c,  m)  ■  Tm  -  f(c,  m)  ■  Tm 

^-c,m 

AE(c,m,Tm) 
fE{c,  m)  ■  rm 

fE(c,  m)  ■  Tm  -  fE(c,  m)  ■  Tm 

where  rm  is  the  pitch  period  at  frame  m.  The  average  instantaneous  frequency  f(c,m)  is 
estimated  from  the  zero  crossing  rate  of  A(c,  m,r).  Multiplying  /(c,  m)  with  rm  provides 
an  alternative  way  of  periodicity  comparison  [14]  and  supplements  the  autocorrelation  based 
measure  in  the  feature  vector.  So,  the  second  and  third  features  imply  how  close  the  period 
within  ucm  is  to  the  pitch  period  -  the  nearest  integer  [•]  to  the  product  indicates  a  harmonic 
number  and  the  distance  |  •  |  between  the  product  and  the  above  integer  measures  the  deviation 
between  the  two  periods.  Similarly,  we  calculate  the  envelope  of  the  hair  cell  output  hE(c,t ) 
and  the  corresponding  normalized  correlogram  AE(c,  m ,  r)  to  derive  the  next  three  features  in 
(2).  Here,  the  purpose  is  to  extract  amplitude  modulation  (AM)  for  high-frequency  channels 
because  these  channels  respond  to  unresolved  harmonics  and  response  envelopes  better  reveal 
the  periodicities  of  these  harmonics.  To  extract  AM,  we  perform  band-pass  filtering  with  the 
passband  from  50  to  550  Hz,  which  corresponds  to  the  plausible  pitch  range  of  the  target 
speech. 

The  desired  value  of  the  grouping  cue  Cg(c,  m )  is  defined  to  be  1  if  ucm,  is  dominated  by  the 
target  stream  and  0  otherwise,  consistent  with  the  notion  of  the  ideal  binary  mask  [51]  which 
labels  a  T-F  unit  as  target  if  and  only  if  target  energy  is  greater  than  interference  energy 
within  that  unit.  Thus,  the  ideal  binary  mask  provides  the  desired  values  of  Cg(c,  m). 

3.2  MLP  Training 

We  use  an  MLP  to  learn  the  grouping  cue  Cg(c,  m)  from  the  pitch-based  features  xc  m.  Train¬ 
ing  usually  minimizes  an  objective  function  (i.e. ,  error  function)  defined  as  the  square  distance 
between  desired  and  actual  outputs.  Our  previous  study  [28]  uses  a  conventional  MSE  objec¬ 
tive  function,  defined  as 

'  XXdcM  -  yc(m))2  (3) 

m 

where  dc(m )  and  yc(m)  are  desired  (binary)  and  actual  outputs,  m  frame  index,  M  the  total 
number  of  frames,  and  c  channel  index.  One  MLP  is  trained  for  each  channel.  The  model 
using  the  above  objective  function  performs  reasonably  well  [28].  However  Jc  treats  all  T-F 
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units  equally.  Such  treatment  may  not  be  optimal  -  a  T-F  unit  with  higher  energy  contributes 
more  to  the  overall  SNR  than  a  unit  with  lower  energy.  In  other  words,  minimizing  Jc  does 
not  necessarily  lead  best  SNR  performance. 

In  order  to  derive  an  objective  function  that  directly  relates  to  the  goal  of  maximizing 
SNR,  we  start  by  analyzing  the  SNR  definition.  Since  the  computational  goal  of  our  proposed 
segregation  system  is  to  identify  T-F  regions  that  are  target  dominant,  we  use  the  same  SNR 
measure  in  [24] ,  which  regards  the  resynthesized  signal  from  the  ideal  binary  mask  as  ground 
truth 


SNR 


10  l°g10 


E t  (si(t)  -  sE(t))2 


(4) 


Here  si(t)  and  s#(f)  are  signals  resynthesized  from  the  ideal  binary  mask  and  an  estimated 
mask,  respectively.  Consider  the  SNR  in  a  single  channel  as  training  is  independently  con¬ 
ducted  within  individual  channels.  To  maximize  the  overall  SNR  we  maximize  SNR  in  each 
channel.  Rewrite  (4)  for  a  single  channel  as 


SNRC 


10  log10 


Em  dc(m)  ■  Ec(m ) 

Em  \dc(m)  -  Yc{m)\  •  Ec(m) 


(5) 


where  Ec(m )  represents  the  mixture  energy  within  ucm,  calculated  as  the  sum  of  squares  of 
the  unit  response.  Yc(m )  is  an  actual  binary  label,  binarized  from  yc(m).  From  (5),  it  is 
intuitively  clear  that  minimizing  the  denominator  maximizes  SNRC.  Therefore,  we  define  the 
new  objective  function  J'c  as 

Jc  =  E(^W  -  Vcirn))2  •  Ec(m )  /  Y,  Ec(m).  (6) 


Note  that  the  function  J'c  is  modified  from  the  denominator  in  (5)  in  order  to  make  it  differ¬ 
entiable,  needed  for  applying  gradient  descent  learning.  The  denominator  in  (6)  is  added  for 
the  purpose  of  normalization  (cf.  (3)).  It  is  worth  mentioning  that  J'c  is  a  generalized  form  of 
MSE,  with  each  squared  error  weighted  by  normalized  energy  within  the  corresponding  T-F 
unit. 

From  the  machine  learning  point  of  view,  the  inclusion  of  weights  in  a  classification  task  is 
known  as  cost-sensitive  learning.  It  is  optimal  learning  when  different  misclassification  errors 
should  incur  different  penalties,  which  is  exactly  the  situation  we  are  facing.  A  number  of  stud¬ 
ies  have  discussed  cost-sensitive  neural  networks  [29,58].  Ideas  include  over /under-sampling, 
network  output  adapting,  learning  rate  adapting,  and  minimization  of  misclassification  costs. 
The  first  idea  is  to  change  the  training  data  distribution  such  that  the  costs  of  the  examples  are 
conveyed  by  the  frequency  of  the  examples.  Over-sampling  method  increases  the  training  time 
and  may  lead  to  overfitting  since  it  makes  multiple  copies  of  the  examples  with  higher  costs; 
while  the  performance  of  under-sampling  method  may  also  be  degraded  because  it  discards 
potentially  useful  training  examples.  The  second  idea  scales  the  network  output  to  give  the 
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Figure  2:  Applying  MLP  in  unit  labeling,  where  a  binary  decision,  target  or  interference  dominant,  is  made 
in  each  T-F  unit.  In  the  above  cochleagram,  the  units  labeled  as  target  dominant  are  indicated  by  white  and 
those  labeled  otherwise  are  indicated  as  black. 

classes  with  higher  expected  misclassification  costs  higher  impact  on  the  learning.  It  does  not 
fit  onr  case  because  costs  are  only  related  to  different  classes  instead  of  individual  examples. 
The  third  and  the  fourth  ideas  are  similar  to  giving  high-cost  examples  higher  weights.  The 
difference  lies  in  ways  of  incorporating  weights  in  training  -  one  method  increases  learning 
rate  for  high-cost  examples  and  the  other  changes  the  objective  function  into  a  weighted  form. 
We  choose  to  use  the  fourth  method  because  it  tackles  the  problem  in  a  more  explicit  way  and 
has  the  exact  form  as  derived  in  (6).  The  backpropagation  algorithm  is  adapted  in  this  case 
to  learn  MLP  parameters.  In  theory,  each  of  the  weights  in  (6)  acts  as  a  constant  factor  in 
the  partial  derivatives  of  J'c.  So  the  delta  rule  can  be  easily  rewritten.  It  should  also  be  noted 
that  the  normalization  term  in  (6)  is  necessary  to  ensure  the  convergence  of  the  modified 
backpropagation  algorithm  [29]. 

In  implementation,  we  train  one  MLP  for  each  channel.  Each  MLP  has  the  same  network 
topology  with  6  input  nodes,  20  hidden  nodes  and  1  output  node.  The  number  of  hidden 
nodes  is  chosen  based  on  ten-fold  cross-validation.  The  transfer  function  of  the  hidden  and 
output  layers  are  both  hyperbolic  tangent  sigmoid.  During  training,  we  use  J'c  in  conjunction 
with  a  generalized  Levenberg-Marquardt  backpropagation  algorithm  [22]  which  achieves  fast 
convergence  by  avoiding  the  computation  of  the  Hessian  Matrix. 

3.3  MLP  Based  Unit  Labeling 

For  each  T-F  unit  ucm,  we  apply  the  trained  MLP  to  feature  vector  xcrn  yielding  Cg(c,m).  It 
should  be  noted  that  each  channel  has  a  separately  trained  MLP,  as  illustrated  in  Fig.  2.  We 
then  use  this  grouping  cue  to  label  ucm.  Formally  speaking,  the  trained  MLP  estimates  the 
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posterior  probability  directly  [6,38],  therefore  the  grouping  cue  can  be  described  as: 

Cg(c,m)  —  P{Hi\x.cm)  (7) 

where  Hi  is  the  hypothesis  of  ucm  being  target  dominant.  Let  H0  be  the  hypothesis  of 
ucrn  being  interference  dominant.  Consequently,  we  define  the  unit  labeling  criterion:  A  T- 
F  unit  ucm  is  labeled  as  target  speech  if  P(Hi  |xcm)  >  P(H0 |xCim).  Due  to  the  fact  that 
P(i7i|xCjm)  and  P(i7o|xc,m)  sum  to  one,  the  above  inequality  can  be  written  as  P(Pi|xCjm)  > 
1  —  P(Pi|xc,m).  Hence,  this  criterion  can  be  simplified  as 

Cg(c,  m)  >  1/2.  (8) 

Note  that  the  above  criterion  is  based  on  the  assumption  that  the  priors  P(Hq)  and  P{H\ ) 
remain  unchanged  during  training  and  labeling  phases.  When  we  interpret  the  MLP  output  as 
an  estimate  of  the  posterior  probability,  this  estimate  encapsulates  prior  information  according 
to  the  Bayes  rule.  In  other  words,  the  decision  rule  is  optimal  only  if  there  is  no  mismatch 
between  training  and  test  priors.  When  this  condition  is  violated,  the  decision  rule  becomes 
sub-optimal  and  possibly  unacceptable  [1],  Although  this  is  not  a  concern  in  our  paper,  we 
discuss  in  Section  7  circumstances  in  which  such  a  mismatch  may  occur  and  possible  solutions 
to  compensate  the  classifier  for  more  reliable  performance. 

4  Segmentation  and  Grouping 

The  segmentation  and  grouping  stage  segregates  a  reverberant  mixture  into  a  target  and  an 
interference  stream.  T-F  unit  labeling  gives  one  way  of  segregation,  however  it  is  error-prone 
because  a  local  unit  is  too  small  for  robust  decisions  in  the  presence  of  interference  and  room 
reverberation.  This  is  supported  by  a  comparison  between  the  segregation  results  at  unit  and 
segment  levels  [24],  To  utilize  more  global  information  of  the  source  that  is  missing  from 
individual  units,  we  adopt  the  stage  of  segmentation  in  CASA  [53]  and  form  segments  on 
the  T-F  plane  based  on  auditory  cues.  A  segment  is  a  contiguous  region  of  T-F  units  and 
segment-level  information  is  expected  to  provide  a  more  robust  foundation  for  grouping. 

To  segment  reverberant  mixtures,  we  apply  two  different  strategies  in  different  frequency 
ranges.  Specifically,  in  low  frequency  (below  800  Hz)  we  merge  T-F  units  into  segments 
based  on  cross-channel  correlation  and  temporal  continuity  [52],  The  first  cue  arises  from 
the  fact  that  a  single  harmonic  or  formant  activates  a  number  of  adjacent  channels  due  to 
their  overlapping  bandwidths  and  their  responses  are  highly  correlated.  In  addition,  a  signal 
usually  lasts  for  some  time,  which  implies  temporal  continuity.  The  cross-channel  correlation 
is  calculated  as 

1  l-i  ^ 

C (c,  m)  —  —  A(c,  m,  r)  A(c  +  1,  m,  r)  (9) 

L  T= o 
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Figure  3:  Example  of  segmentation,  (a)  segmentation  using  cross-channel  correlation;  (b)  segmentation  using 
cross-channel  correlation  in  low-frequency  range  and  onset/offset  analysis  in  high-frequency  range.  White 
indicates  background.  Regions  of  different  gray  levels  indicate  different  T-F  segments. 

where  A(c,  m,  r)  is  a  normalized  autocorrelation  function  with  zero  mean  and  unit  variance 
and  L  is  the  maximum  delay  for  the  plausible  pitch  range.  Only  units  with  sufficiently  high 
cross-channel  correlation  -  greater  than  0.99  [25]  -  are  selected  and  iteratively  merged  into 
segments. 

It  is  widely  known  that  amplitude  modulation  effects  of  unresolved  harmonics  occur  in 
high  frequency  channels.  In  segmentation,  the  cross-channel  correlation  Ce{c,  m),  which  is 
calculated  from  the  response  envelopes  He(c,  t),  has  been  proven  to  be  useful  [25].  However, 
Ce{c,  nr)  is  sensitive  to  room  reverberation,  which  hinders  its  application  (see  Section  5). 
Signal  onsets,  on  the  other  hand,  are  relatively  unaffected  by  room  reverberation  and  offsets 
could  be  determined  between  detected  onsets.  Therefore,  we  propose  that  high-frequency 
regions  be  segmented  using  onset  and  offset  detection  [26].  This  method  first  smooths  signal 
intensity  over  time  to  reduce  insignificant  fluctuations  and  then  over  frequency  to  enhance 
synchronized  onsets  and  offsets.  It  then  detects  onsets  and  offsets  from  smoothed  intensity  in 
each  channel.  Segments  are  formed  by  matching  pairs  of  onset  and  offset  fronts,  which  are  the 
vertical  contours  connecting  onset  and  offset  candidates  across  frequency.  In  order  to  achieve 
a  compromise  between  over-  and  under-segmentation,  a  multiscale  integration  is  applied  from 
a  coarse  scale  to  a  fine  scale.  Along  the  scale  change,  new  segments  are  created  and  existing 
segments  are  better  localized.  Fig.  3  compares  segmentation  with  and  without  using  onset  and 
offset  cues.  The  input  signal  is  a  reverberant  mixture  of  a  voiced  utterance  and  a  pure  tone 
in  a  room  whose  Tqq  =  0.3  s.  As  can  be  seen  in  Fig.  3(b),  more  significant  segments  in  high 
frequency  are  detected,  indicating  more  effective  segmentation  using  onset/offset  analysis. 

With  unit  labels  obtained  in  Section  3.3  together  with  T-F  segments,  we  group  each  segment 
into  the  target  stream  if  the  energy  corresponding  to  its  T-F  units  with  target  labels  (Is) 
dominates,  i.e.,  greater  than  the  energy  of  the  T-F  units  with  non-target  labels  (0s).  Finally, 
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to  group  more  units  into  the  target  stream,  we  expand  each  segment  in  the  target  stream  by 
iteratively  recruiting  its  neighboring  units  that  are  labeled  as  target  and  do  not  belong  to  any 
segment  [25].  Consequently,  a  binary  mask  is  formed  and  the  segregated  target  speech  can  be 
resynthesized  from  this  mask  for  performance  evaluation  [53]. 

5  Evaluation  and  Comparison 

5.1  Corpus  Generation 

To  simulate  typical  room  acoustics,  we  use  the  image  model  which  is  commonly  applied  for 
efficient  simulation  of  the  acoustic  properties  of  enclosures  [2],  The  basic  idea  of  the  Allcn- 
Berkley  image  model  is  that  the  room  impulse  response  (RIR)  can  be  represented  as  an  infinite 
number  of  image  sources  that  are  created  by  reflecting  an  acoustic  source  in  six  room  walls. 
In  such  a  model,  a  pair  of  physical  locations,  corresponding  to  the  source  and  the  microphone, 
decide  RIR  in  a  fixed  room.  In  order  to  simulate  both  convolutive  and  additive  distortions, 
we  specify  the  locations  of  the  target  and  one  interfering  source  and  one  more  location  for 
the  microphone.  More  specifically,  we  start  with  anechoic  target  speech  s(t)  and  anechoic 
interference  n(t).  We  then  generate  a  simulated  room  and  randomly  create  a  set,  {rT ,r j ,r M} , 
representing  locations  of  the  target,  the  interference  and  the  microphone  inside  the  room, 
respectively.  From  these  locations,  two  RIR’s  are  calculated  by 


hT(t)  =  /( rT,rM) 

(10) 

hi(t )  =  /( r/,rM). 

(11) 

hpit)  is  the  RIR  corresponding  to  the  recorded  target  at  the  microphone,  and  hj(t )  corresponds 
to  the  recorded  interference  at  the  same  microphone.  Both  hpit)  and  hp{t)  are  causal  finite 
impulse  response  (FIR)  filters.  /(•)  denotes  the  image  model  discussed  above,  which  calculates 
the  RIR  with  respect  to  the  input  location  pair.  Consequently,  a  reverberant  mixture  r(t)  is 
constructed  by 

r(t )  =  hT{t)  *  s(t)  +  a  ■  hj(t )  *  n{t)  (12) 

where  denotes  convolution.  We  use  a  as  a  coefficient  in  order  to  set  mixture  SNR  to  0 
dB. 

In  order  to  systematically  evaluate  the  proposed  system  under  different  reverberant  condi¬ 
tions,  we  simulate  six  acoustic  rooms  with  different  sizes  and  their  reverberation  times  (T^o) 
range  from  0.1  to  0.6  s  in  steps  of  0.1  s.  Table  1  shows  detailed  room  specifications.  In  each 
room,  we  randomly  create  three  sets  of  locations  as  mentioned  above,  resulting  in  three  sets 
of  {hpit),  hi(t)}  and  three  sets  of  reverberant  mixtures  created  by  (12).  For  example,  Fig.  4 
illustrates  a  simulated  room  with  Tqq  =  0.3  s.  The  room  size  is  6  x  4  x  3  m  (length,  width, 
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Table  1:  The  settings  of  six  acoustic  rooms  (L:  length,  W:  width,  H:  height) 


Room  No. 

L  x  W 

x  H  (m) 

Reflection  Coeff. 

Tqo  (s) 

1 

4  x 

4x3 

0.40 

0.1 

2 

5  x 

4x3 

0.62 

0.2 

3 

6  x 

4x3 

0.73 

0.3 

4 

7  x 

5x3 

0.80 

0.4 

5 

8  x 

5x3 

0.84 

0.5 

6 

9  x 

5x3 

0.87 

0.6 

Figure  4:  Room  configurations  with  three  sets  of  locations  {r^T/TM}  randomly  created  in  an  enclosure. 
Each  line  indicates  the  direct  transmission  path  from  the  source  to  the  microphone,  corresponding  to  one 
RIR.  For  clarity,  the  example  room  is  shown  in  two  dimensions,  though  the  simulations  are  performed  in  three 
dimensions.  The  solid  line  represents  the  RIR  from  which  the  inverse  filter  is  estimated. 

height),  but  the  figure  only  shows  the  first  two  dimensions  for  clarity.  The  pentagram,  the 
circle  and  the  asterisk  display  locations  of  {rT,r/,rM}  in  each  set  respectively. 

Onr  evaluation  first  uses  Cooke’s  corpus  [14],  which  contains  100  noisy  utterances  con¬ 
structed  by  mixing  10  anechoic  voiced  utterances  and  10  different  types  of  interference.  In  the 
aforementioned  way,  we  generate  a  total  of  1,900  mixtures,  with  the  original  100  mixtures  in 
anechoic  and  6  x  3  x  100  mixtures  in  reverberant  conditions.  We  further  evaluate  the  proposed 
system  using  utterances  from  the  TIMIT  speech  corpus  [20].  Four  speakers,  two  males  and 
two  females,  are  randomly  selected  from  “DR1”  through  “DR4”  dialect  regions,  respectively. 
For  each  speaker,  we  mix  10  anechoic  utterances  with  the  same  10  interferences  to  generate 
1,900  mixtures.  The  simulated  rooms  and  source  and  microphone  locations  are  the  same  as 
in  the  evaluation  with  Cooke’s  corpus. 
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5.2  Sensitivity  of  Inverse  Filtering 

As  mentioned  in  Section  1,  one  main  problem  with  the  inverse  filtering  approach  is  the  sensi¬ 
tivity  to  even  small  changes  in  the  acoustic  environment.  In  other  words,  if  an  inverse  filter  is 
estimated  from  the  same  RIR  used  in  segregation,  i.e. ,  matched  inverse  filtering,  it  is  expected 
to  enhance  speech  harmonicity;  otherwise  it  may  further  smear  the  harmonic  structure.  Fig.  5 
illustrates  the  effects  of  applying  the  same  inverse  filter  to  a  matched  RIR  and  a  mismatched 
RIR  with  the  source  location  moved.  As  can  be  seen  in  Fig.  5(b),  the  equalized  response  in  the 
matched  condition  is  much  impulse-like,  indicating  the  success  of  reverberation  attenuation, 
while  as  shown  in  Fig.  5(d)  the  mismatched  condition  leads  to  further  smearing. 

We  quantitatively  evaluate  the  sensitivity  of  inverse  filtering  in  terms  of  signal-to-reverberant 
energy  ratio  (SRR).  SRR  is  an  indicator  of  intelligibility  of  reverberant  speech  [27]  and  hence 
provides  a  measure  of  the  effectiveness  of  inverse  filtering.  SRR  is  defined  as: 

SRR,  =  10  log ,  0  (^j  p2(t)d.t j  p2(t)dt^j  (13) 

where  p{t)  is  the  instantaneous  sound  pressure  of  the  RIR  measured  at  time  t,  and  tx  is 
the  arrival  time  of  the  first  peak  from  the  reflected  impulses.  A  larger  SRR  value  indicates 
better  inverse  filtering.  Table  2  shows  the  SRR  improvement  after  applying  the  inverse  filter 
to  RIR’s  in  the  six  rooms  in  Table  1.  We  have  already  created  for  each  acoustic  room  six 
different  RIR’s  from  three  sets  of  described  in  Section  5.1.  These  six  RIR’s  are 

named  {“la”,  “lb”,  “2a”,  “2b”,  “3a”,  “3b”}  as  presented  in  Fig.  4,  where  “a”  refers  to  the  RIR 
from  {rT,r*f}  and  “b”  from  {r/,rM}  in  each  set.  The  inverse  filter  is  estimated  to  equalize  the 
RIR  (“la”  in  Room  3,  the  solid  line  shows  the  direct  path  in  Fig.  4)  using  [21],  To  examine 
the  sensitivity  of  inverse  filtering,  this  estimated  inverse  filter  is  used  to  convolve  with  all 
6  x  6  =  36  RIR’s  and  their  resulting  SRR’s  are  calculated  accordingly.  It  is  evident  from 
Table  2  that  significant  SRR  improvement  only  occurs  under  the  matched  inverse  filtering 
condition  (shown  in  italic).  The  SRR  drops  for  almost  all  other  cases,  implying  a  further 
smearing  effect  caused  by  mismatched  inverse  filtering. 

Our  empirical  results  are  in  accordance  with  the  observation  reported  in  [37]  where  it  is 
stated  that  inverse  filtering  increases  the  distortion  when  a  response  recorded  at  a  different 
position  is  employed  for  dereverberation.  Radlovic  et  al.  [42]  gave  a  theoretical  analysis  on  the 
sensitivity  of  inverse  filtering.  In  their  paper,  a  quantitative  distortion  measure  of  frequency 
responses  is  used  based  on  the  difference  between  the  two  transfer  functions  from  the  source  to 
the  reference  and  to  the  displacement  point,  respectively.  The  measure  calculated  in  simulation 
is  in  good  agreement  with  their  theoretical  derivation  and  shows  that  small  changes  in  the 
source  or  microphone  position  on  the  order  of  one-tenth  of  the  acoustic  wavelength  cause 
significant  degradations  in  the  equalized  room  response.  It  is  also  pointed  out  that  greater 
distortion  is  expected  for  high  frequencies,  indicating  higher  sensitivity  to  position  changes  in 
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Figure  5:  Effects  of  inverse  filtering  on  room  impulse  responses,  (a)  A  RIR  generated  by  the  image  model. 
The  source  and  the  microphone  are  at  (4,  0.9,  1)  and  (2,  1,  1)  respectively  in  Room  3,  as  listed  in  Table  1. 
(b)  The  result  of  convolving  the  RIR  in  (a)  with  the  estimated  inverse  filter,  (c)  A  different  RIR  function  in 
the  same  room  but  with  the  source  location  changed  to  (0.6,  2.4,  1).  (d)  The  result  of  convolving  the  impulse 
response  in  (c)  with  the  same  estimated  inverse  filter. 

the  high-frequency  range. 

5.3  SNR  Results 

Given  that  the  computational  objective  of  our  segregation  system  is  to  identify  T-F  regions 
that  are  target  dominant,  we  adopt  the  same  SNR  measure  in  [24]  to  assess  the  segregation 
performance  using  the  resynthesized  speech  from  the  ideal  binary  mask  as  the  ground  truth. 
Equation  (4)  gives  this  measure.  Considering  that  harmonic  features  extracted  in  Section  3.1 
likely  vary  with  changing  acoustic  environments,  we  evaluate  the  proposed  system  in  three 
different  scenarios  which  place  different  levels  of  demand  on  generalization: 

Case  1 

Reverberation  time  is  known.  In  this  case,  our  evaluation  is  conducted  within  the  same 
room  but  assesses  the  system’s  ability  to  generalize  to  different  source/receiver  locations.  For 
example,  we  train  on  one  set  of  100  reverberant  mixtures  in  Room  3  with  Tq$  =  0.3  s  and 
test  the  resulting  system  on  the  other  two  sets  of  mixtures  in  the  same  room  in  terms  of 
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Table  2:  Signal-to-reverberant  ratio  (SRR)  change  (in  dB)  by  applying  an  estimated  inverse  filter  to  each 
room  impulse  response  function 


Room  No. 

la 

lb 

2a 

2b 

3a 

3b 

1 

-7.8 

-6.8 

-7.4 

-6.0 

-7.5 

-8.4 

2 

-3.7 

-2.8 

-3.8 

-4.5 

-4.4 

-2.7 

3 

7.2 

-2.1 

-0.2 

-2.9 

-2.8 

-2.3 

4 

-1.5 

-1.1 

-0.3 

-2.7 

-0.8 

-0.5 

5 

-2.1 

-0.4 

-2.2 

-1.4 

-1.4 

-1.4 

6 

0.0 

1.2 

-0.3 

0.2 

1.3 

-0.1 

SNR  gain.  The  dotted  line  in  Fig.  6  represents  this  case.  The  performance  curve  depicts 
the  SNR  gain  -  the  improvement  over  the  initial  SNR  before  segregation  -  of  seven  separate 
systems,  each  trained  at  a  different  Tq q.  This  curve  should  represent  the  performance  upper 
bound  of  our  system  in  unknown  reverberant  conditions.  The  observed  performance  drop  with 
increasing  reverberation  likely  reflects  the  nature  of  the  ascending  difficulty  of  segregation.  In 
other  words,  segregation  in  highly  reverberant  conditions  is  probably  a  harder  task  than  in 
low  reverberant  conditions.  Subjective  tests  reveal  that  human  listeners’  ability  to  separate 
competing  voices  degrades  with  increasing  levels  of  reverberation  [16]. 

Case  2 

With  unknown  Tqq,  train  on  all  different  Tq q’s.  Specifically,  we  form  a  training  corpus  with 
a  total  of  700  reverberant  mixtures  by  using  the  first  set  of  mixtures  in  each  room  together 
with  anechoic  mixtures.  The  pentagram  line  in  Fig.  6  shows  the  system  performance  in  this 
case.  This  way  of  training  gives  a  single  system  regardless  of  reverberant  conditions  and  the 
performance  is  only  about  0.5  dB  worse  on  average  compared  to  the  known  room  case.  A 
downside  is  that  training  now  becomes  computationally  more  expensive  since  the  training  set 
is  seven  times  as  large  as  training  in  a  single  room. 

Case  3 

With  unknown  T60,  train  on  a  single  T(m-  If  we  assume  reverberation  time  is  more  likely  above 
0.3  s  which  is  typical  of  rooms  encountered  in  daily  life  [30],  we  can  train  at  T60  =  0.6  s,  the 
most  reverberant  condition  because  generalization  to  less  reverberation  may  be  better  than 
the  other  way  around.  The  rationale  here  is  to  obtain  the  best  possible  classifier  under  the 
least  favorable  condition,  often  referred  to  as  a  MINIMAX  solution  [17,50].  The  SNR  gain 
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of  this  case  is  the  circle  line  in  Fig.  6.  Some  degradation  is  observed,  but  the  system  yields 
relatively  good  performance  at  high  T60’s. 

Although  the  proposed  system  is  designed  to  segregate  reverberant  speech,  the  above  results 
suggest  that  our  system  also  works  well  in  the  anechoic  condition  (T60  =  0  s).  Using  multiple 
features  for  estimating  the  grouping  cue,  our  system  shows  a  13.8  dB  SNR  gain  under  the 
anechoic  condition  even  when  it  is  trained  at  T60  =  0.6  s.  As  a  comparison,  the  segregation 
system  by  Hu  and  Wang,  which  is  designed  for  and  tested  on  voiced  speech  mixtures  in 
anechoic  conditions  [24],  produces  an  SNR  gain  of  12.9  dB  on  the  same  corpus.  This  indicates 
that  our  system  performs  a  little  better  than  the  Hu- Wang  system  in  the  anechoic  condition. 
If  training  is  matched  with  the  anechoic  condition,  our  system  achieves  an  SNR  gain  of  15.8 
dB  (see  Fig.  6)  which  is  significantly  higher  than  that  of  their  system. 

We  also  compare  alternative  segmentation  methods:  cross-channel  correlation  alone,  on¬ 
set/offset  analysis  alone,  and  the  proposed  method  which  combines  these  two  methods.  All 
methods  use  the  same  training  with  one  set  of  100  reverberant  mixtures  in  Room  6  with 
Xqo  =  0.6  s  from  Cooke’s  corpus.  Fig.  7  shows  the  system  performance  in  terms  of  SNR  gain 
in  different  reverberant  conditions.  The  proposed  segmentation  using  cross-channel  correla¬ 
tion  in  low  frequency  and  onset/offset  analysis  in  high  frequency  yields  the  best  performance. 
The  method  using  just  cross-channel  correlation  -  C(c,  m)  in  low  frequency  and  Ce(c,  m)  in 
high  frequency  -  performs  about  0.4  dB  worse.  This  difference  indicates  the  utility  of  on¬ 
set/offset  analysis  in  the  high-frequency  range.  The  method  using  onset/offset  analysis  across 
all  frequencies  gives  the  worst  performance.  This  method  forms  segments  by  matching  on¬ 
set  and  offset  fronts  and  segment  boundaries  tend  to  be  block-like,  missing  detailed  segment 
shapes.  Although  it  performs  a  little  better  than  the  cross-channel  correlation  method  in  the 
high-frequency  range,  it  underperforms  the  latter  in  the  low-frequency  range.  Note  that  the 
superior  performance  reported  in  [26]  is  based  on  a  region-level  measure  not  an  SNR  measure. 

5.4  Comparison  with  Roman- Wang  Model 

In  this  subsection,  we  use  Cooke’s  corpus  to  compare  the  performance  of  our  proposed  system 
to  that  of  the  inverse  filtering  based  approach  by  Roman  and  Wang  [45].  In  their  system, 
an  inverse  filter  is  first  estimated  by  maximizing  the  kurtosis  of  the  inverse-filtered  linear 
prediction  residual  of  the  reverberant  speech  from  the  target  location  in  the  absence  of  in¬ 
terference  [21,57].  Then,  the  obtained  inverse  filter  is  applied  to  the  reverberant  mixture 
consisting  of  both  the  reverberant  target  and  the  reverberant  interference.  In  order  to  make  a 
fair  comparison  between  the  Roman- Wang  and  the  proposed  system,  we  use  the  same  subset 
of  reverberant  mixtures  for  learning.  Specifically,  in  the  Roman- Wang  system,  we  use  the 
same  inverse  filter  (as  used  in  the  above  evaluation)  that  is  estimated  from  the  RIR  of  “la”  in 
Room  3.  In  the  proposed  system,  MLP  learning  also  takes  place  on  the  reverberant  mixtures 
generated  by  the  first  set  of  {rT,r/,rM}  in  Room  3.  Since  both  systems  need  pitch  estimation, 


16 


OS  U  Dept,  of  Computer  Science  and  Engineering  Technical  Report.  #21,  2008 


Figure  6:  Voiced  speech  segregation  performance.  SNR  gain  is  measured  under  room  conditions  with  Tq o 
ranging  from  0  to  0.6  s.  The  dotted  line,  the  pentagram  line  and  the  circle  line  represent  three  cases  discussed 
in  the  text. 


Figure  7:  Comparison  of  SNR  gain  between  different  segmentation  methods.  The  square  line,  the  circle  line 
and  the  pentagram  line  represent  three  segmentation  methods. 


we  use  a  priori  pitch  contours  from  premixed  reverberant  targets  extracted  using  Praat  [7]  in 
order  to  remove  the  influence  of  pitch  errors  on  the  segregation  system  (see  also  [28]).  Table  3 
summarizes  the  SNR  gain  evaluation.  Each  number  in  the  table  presents  the  average  SNR 
gain  on  reverberant  mixtures  generated  by  a  particular  set  of  {r-r,r/,rM}  in  one  of  the  rooms. 
The  two  numbers  in  italic  correspond  to  matched  training  conditions  as  stated  above.  Their 
difference  shows  the  advantage  of  the  proposed  system  when  training  and  testing  on  the  same 
set  of  mixtures.  The  number  at  the  bottom  of  each  column  is  the  average  of  the  three  sets, 
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which  provides  the  average  SNR  gain  in  the  same  room.  The  SNR  gains  under  the  anechoic 
condition  (not  shown  in  the  table)  are  10.37  dB  and  14.87  dB  for  the  Roman- Wang  and  the 
proposed  system,  respectively.  As  can  be  observed  in  Table  3,  the  proposed  system  achieves 
significantly  higher  SNR  gains  across  all  different  reverberation  times  than  the  Roman- Wang 
system.  The  overall  3.4  dB  improvement  is  mainly  brought  about  by  accurate  unit  labeling  in 
high  frequency:  their  system  cannot  reliably  handle  unresolved  harmonics  in  those  regions  for 
reverberant  signals.  Through  MLP  training,  the  T-F  units  in  both  low  and  high  frequencies 
can  be  labeled  in  our  system.  Note  that  this  improvement  partly  results  from  segmentation 
using  multiscale  onset/offset  analysis  in  high  frequency  (see  Fig.  7). 

In  [45],  Roman  and  Wang  reported  a  1.3  dB  difference  between  the  systems  with  and 
without  using  inverse  filtering  when  Tqq  =  0.35  s.  Without  inverse  filtering  as  a  preprocessing 
stage,  their  system  is  quite  similar  to  the  Hu- Wang  system  [24],  which  was  discussed  earlier. 
This  implies  that  our  system  performs  much  better  than  the  Hu- Wang  system  when  it  is 
applied  to  reverberant  mixtures.  In  other  words,  although  there  are  reasons  to  expect  that 
pitch-based  grouping  may  not  be  very  sensitive  to  reverberation,  such  separation  algorithms 
unlikely  perform  well  without  dealing  with  the  issue  of  reverberation. 


Table  3:  Comparisons  of  SNR  gain  (in  dB)  between  the  proposed  system  and  the  Roman- Wang  system 


Teo(s) 

0.1 

0.2 

Roman- Wang 

0.5 

0.6 

0.1 

0.2 

Proposed 

0.5 

0.6 

0.3 

0.4 

0.3 

0.4 

Set  1 

10.24 

9.58 

10.07 

8.36 

8.92 

6.69 

14.44 

13.52 

13.48 

11.71 

12.59 

8.92 

Set  2 

10.86 

9.36 

9.74 

9.40 

7.54 

7.88 

15.37 

13.26 

13.37 

12.08 

10.09 

9.59 

Set  3 

10.59 

8.04 

7.99 

9.15 

9.51 

7.16 

14.84 

11.53 

12.18 

11.65 

12.16 

8.93 

Average 

10.56 

8.99 

9.27 

8.97 

8.66 

7.24 

14.88 

12.77 

13.01 

11.81 

11.61 

9.15 

5.5  Evaluation  on  TIMIT 

Here  we  evaluate  how  the  proposed  system  generalizes  to  new  speakers  and  utterances.  As 
described  in  Section  5.1,  four  speakers,  two  males  and  two  females,  are  randomly  selected 
from  the  TIMIT  database.  We  label  the  four  speakers  as:  SPK-l(M),  SPK-2(M),  SPK-3(F) 
and  SPK-4(F),  where  “M”  stands  for  male  and  “F”  female.  In  order  to  compensate  for  the 
discrepancies  between  male  and  female  speakers,  we  train  on  the  first  set  of  mixtures  at  Room 
6  from  both  genders.  The  training  corpus  contains  200  mixtures,  one  half  from  SPK-1  and 
the  other  half  from  SPK-3.  The  resulting  system  may  be  called  speaker- independent  (SI).  For 
each  speaker,  we  also  train  a  system  on  the  first  set  of  mixtures  at  Room  6  from  that  speaker 


18 


OS  U  Dept,  of  Computer  Science  and  Engineering  Technical  Report.  #21,  2008 


only.  These  systems  are  called  speaker-dependent,  or  SD,;,  where  i  G  {1, 2,  3, 4}  represents  the 
speaker  index.  The  motivation  of  training  a  system  for  each  speaker  is  to  evaluate  performance 
in  the  matched  training  scenario,  which  offers  a  reference  for  performance  analysis.  The 
SNR  difference  between  SI  and  SD  indicates  degradation  due  to  unmatched  training.  Since 
utterances  from  the  TIMIT  database  contain  unvoiced  speech  while  our  system  deals  with 
only  voiced  speech,  we  calculate  the  SNR  at  voiced  speech  frames  only.  Fig.  8  shows  the  SNR 
comparison  between  SI  and  SD  for  each  of  the  four  speakers.  Note  that,  the  SNR  performance 
in  Fig.  8  is  lower  than  that  of  the  previous  experiment  due  to  the  use  of  TIMIT  sentences 
whose  spectra  significantly  overlap  with  those  of  interferences.  The  SNR  gain  at  each  Tq o 
condition  is  averaged  over  all  three  sets  of  mixtures  in  that  condition,  the  same  as  the  last 
row  in  Table  3.  SPK-2  and  SPK-4  are  the  test  cases  using  entirely  different  speakers,  within 
which  70%  of  the  sentenses  are  new.  The  SI  curve  is  not  much  lower  than  the  SD  curve  for 
these  two  speakers,  demonstrating  that  our  system  generalizes  well  to  both  unforseen  speakers 
and  utterances.  Note  that  degradation  also  exists  in  SPK-1  and  SPK-3,  albeit  smaller  than 
those  in  SPK-2  and  SPK-4.  This  degradation  arises  because  the  training  corpus  contains 
two  speakers  rather  than  a  single  matching  one.  Table  4  gives  numeric  results  of  average 
SNR  degradation  across  all  Teo's  for  different  speakers.  The  maximum  degradation  is  0.67  dB 
for  SPK-4,  which  is  fairly  small  compared  to  SNR  performance  variations  for  different  room 
conditions. 


Table  4:  Average  SNR  degradation  (in  dB)  in  SI  systems  relative  to  SD  systems 


SPK-1 

SPK-2 

SPK-3 

SPK-4 

0.19 

0.42 

0.28 

0.67 

6  Robustness  Analysis 

The  feature-based  learning  in  our  proposed  system  shows  good  generalization  ability  to  various 
reverberant  conditions  as  shown  in  Section  5.  For  example,  the  system  trained  at  Room  6 
with  T60  =  0.6  s  performs  well  in  other  room  conditions.  Furthermore,  the  changes  of  source 
and  microphone  locations  within  a  room  little  affect  our  system  performance.  To  understand 
the  surprisingly  robust  performance,  this  section  provides  an  analysis  at  the  feature  level. 

Basically,  the  proposed  system  learns  to  distinguish  between  target  dominant  (class  1) 
and  interference  dominant  (class  0)  T-F  units.  Therefore,  we  reformulate  the  segregation 
problem  into  two-class  classification.  Intuitively,  if  features  are  robust,  we  expect  that  feature 
distributions  in  different  reverberant  conditions  are  close  to  each  other.  Hence,  the  distance 
between  different  feature  sets  can  be  a  quantitative  measure  of  feature  robustness.  From 
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SPK-I(M)  SPK-2(M) 


Figure  8:  Comparison  of  SNR  gain  between  SD  and  SI  cases  under  room  conditions  with  Tgo  ranging  from  0 
to  0.6  s.  For  each  speaker  from  the  TIMIT  database,  the  square  line  is  the  SD  performance  and  the  pentagram 
line  the  SI  performance.  The  first  two  speakers  are  males  and  the  next  two  are  females. 

another  perspective,  the  distance  between  class  0  and  class  1  subsets  within  one  feature  set 
describes  the  classification  complexity  (or  data  separability)  [48].  Therefore,  the  role  of  a 
distance  measure  is  twofold:  it  models  the  feature  variations  in  different  reverberant  conditions 
and  it  compares  data  separability  in  those  conditions  at  the  same  time.  An  example  is  the 
following:  Let  D(-,  •)  be  a  distance  measure.  Consider  two  feature  sets  {Fi,  F2},  each  having 
two  subsets  corresponding  to  class  0  and  class  1.  A  set  of  these  subsets  is  constructed  as 

=  {-Fgo,  -Fpi,  F2)0j  Fgi}  (14) 

where  indicates  the  subset  of  class  j  in  Ft.  On  one  hand,  D(Fi  0,  F2i0)  and  D(Fu,F2< i) 
measure  the  similarity  between  F\  and  F2  in  the  two  classes.  When  both  values  are  small, 
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Figure  9:  Visualization  of  the  CM  distance.  Plus  marks  stand  for  class  0  subsets  and  cross  marks  for  class 
1  subsets.  Each  mark  is  labeled  in  the  form  of  “x.y”  where  x  is  the  room  index  and  y  is  the  set  index.  “0” 
labels  the  anechoic  situation. 

good  generalization  from  one  set  to  the  other  is  expected.  On  the  other  hand,  D(F10,  F1;1)  and 
D(F20,  F21)  measure  the  separability  of  the  feature  sets,  which  relates  to  their  performance 
upper  bound  discussed  in  Section  5.3. 

We  use  the  Constrained  Minimum  (CM)  distance  [49]  as  the  distance  measure  in  this  study. 
The  CM  distance  is  not  only  a  metric,  but  also  capable  of  measuring  classification  complexity 
because  it  is  computed  by  comparing  summary  statistics  of  the  data  sets.  According  to  [49], 
the  CM  distance  is  derived  using  the  geometrical  interpretation  of  the  distribution  and  is  of 
Mahalanobis  type  as 
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dcM(Du  D2\S )2  =  (, 9 r  -  d2)Tcou-1[A](d1  -  02)  (15) 

where  D\  and  D2  are  two  data  sets  and  their  means  are  6\  and  d2.  S  can  be  viewed  as  the 
underlying  feature  distribution  function  and  its  covariance  matrix  con  [S']  can  be  estimated  from 
Di  and  D2  if  it  is  unknown.  In  the  case  of  N  available  feature  sets  in  different  reverberation 
conditions,  we  have 

Tn  =  { F\  o ,  Fiti,  F2) o,  F2t i, . . . ,  Fjv, o,  Fn,i}-  (16) 

The  CM  distance  between  every  pair  of  the  subsets  in  Fn  is  then  calculated.  To  visualize 
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the  relationship  between  these  subsets,  we  reconstruct  2-D  spatial  locations  from  their  CM 
distances  using  the  Metric  Multidimensional  Scaling  (MDS)  technique  [15],  which  transforms 
a  distance  matrix  into  a  set  of  coordinates  such  that  the  Euclidean  distances  derived  from 
these  coordinates  approximate  as  well  as  possible  original  distances. 

Fig.  9  presents  the  2-D  visualization  of  the  CM  distance  of  feature  subsets.  The  38  plotted 
subsets  are  derived  from  19  feature  sets,  which  correspond  to  the  19  sets  of  100  reverberant 
mixtures  generated  from  the  voiced  corpus  described  in  Section  5.1.  Different  reverberant 
conditions  are  denoted  as  x.y  where  x  G  {1, ...  ,6}  represents  room  index  and  y  G  {1,2,3} 
is  set  index.  As  shown  in  the  figure,  the  features  from  each  room  tend  to  cluster  together, 
suggesting  a  strong  similarity  between  them.  This  indicates  that  our  features  are  robust  to 
source/microphone  location  changes  within  a  room.  Based  on  the  observation  that  features  in 
rooms  with  close  Tq#s  also  have  relatively  short  distances,  we  can  conclude  that  the  features 
are  robust  to  different  reverberant  rooms  when  these  rooms  have  close  T60’s.  On  the  right 
side  (“x”)  of  Fig.  9,  there  is  a  clear  pattern  of  position  change  with  the  change  of  T60.  Such 
a  trend  is  not  as  prominent  on  the  left  side  (“+”)  because  our  features  are  pitch-based  and 
background  T-F  units  may  not  be  sensitive  to  such  features.  However,  it  will  not  affect  the 
comparison  of  classification  complexity  as  the  changes  on  the  left  side  are  smaller  than  those 
on  the  right  side  and  can  therefore  be  ignored.  Classification  complexity  can  be  compared 
by  measuring  the  distance  between  the  +  and  the  x  mark  of  the  same  label,  which  indicates 
that  the  two  subsets  come  from  the  same  feature  set.  Fig.  9  suggests  that  the  classification  in 
low  T60’s  is  easier  because  of  its  relatively  large  distance  while  the  classification  in  the  most 
reverberant  situation  (i.e. ,  T60  =  0.6  s)  is  the  most  difficult.  This  is  consistent  with  the  results 
in  Fig.  6. 

7  Discussion 

A  key  problem  in  reverberant  speech  separation  is  smeared  harmonicity,  which  has  negative 
impact  on  harmonic  cues  and  results  in  significant  performance  degradation  in  previous  CASA 
systems  (e.g.,  [24]).  The  approach  of  inverting  reverberant  effects  is  sensitive  to  specific 

configurations,  although  it  achieves  good  performance  in  matched  configuration.  In  this  work, 
a  set  of  six  pitch-based  features  is  extracted  and  these  features  incorporate  information  of 
both  filter  responses  and  their  envelopes.  Therefore,  unlike  [24],  unit  labeling  can  be  handled 
together.  The  harmonic  index  and  the  deviation  from  the  nearest  harmonic,  first  proposed 
in  [23],  are  demonstrated  in  this  study  to  be  effective  supplementary  features  in  modeling 
harmonicity  under  reverberant  conditions  -  when  excluding  those  features  in  the  feature  set, 
the  overall  performance  has  a  significant  drop  in  our  further  experiment.  MLP  provides  a 
way  to  combine  these  features  into  a  unified  grouping  cue.  Significantly  different  from  other 
CASA  systems,  our  supervised  learning  approach  produces  significantly  better  performance 
and  generalizes  well  to  different  reverberant  conditions.  It  is  worth  emphasizing  that  the 
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proposed  system  also  generalizes  well  to  unseen  speakers  and  utterances. 

In  the  Bayesian  framework,  MLP  may  be  viewed  as  an  optimal  classifier  that  discriminates 
target-dominant  units  from  those  belonging  to  the  interference.  One  common  problem  of 
designing  a  classifier  is  the  uncertainty  in  a  priori  class  probability.  Although  the  current  study 
controls  the  SNR  of  all  mixtures  at  0  dB  in  both  training  and  test  phases,  the  above  problem 
is  of  concern  when  SNR  varies.  For  example,  when  testing  on  10  dB  mixtures,  the  system 
tends  to  label  fewer  target  units  than  it  should,  indicating  a  bias  towards  the  interference 
class.  A  common  practice  to  increase  system  robustness  against  uncertain  priors  is  training 
the  classifier  over  a  data  set  with  the  least  biased  priors  [5,31].  To  use  a  training  corpus  of  0 
dB  mixtures  is  consistent  with  the  above  idea.  However,  the  use  of  a  training  set  with  equal 
priors  represents  a  solution  that  is  unbiased  towards  any  priors,  but  it  does  not  theoretically 
imply  robustness  against  other  priors  [1].  Some  research  provides  clues  [35,43,47]  on  how 
to  work  with  different  SNR’s.  An  adaptive  solution  uses  incoming  information  to  reduce  the 
uncertainty  and  improve  the  classifier.  More  specifically,  the  SNR  of  the  incoming  mixture  can 
be  estimated  to  infer  real  priors  and  then  the  classifier  can  be  adapted  according  to  estimated 
priors  (see  [23]). 

Determination  of  pitch  is  a  fundamental  problem  in  CASA  and  reliable  pitch  estimation 
is  critical  for  applying  harmonic  grouping.  Although  pitch  may  be  a  relatively  robust  feature 
to  reverberation  [56],  few  pitch  determination  algorithms  are  developed  in  both  noisy  and 
reverberant  conditions.  In  this  study,  we  use  a  priori  pitch,  calculated  from  reverberant  target 
speech  before  mixing,  and  future  study  needs  to  address  the  pitch  determination  problem  in 
room  reverberation. 

In  summary,  we  have  proposed  a  system  capable  of  segregating  reverberant  target  speech. 
Two  novel  ideas  are  employed.  First,  a  supervised  learning  approach  establishes  the  mapping 
from  a  set  of  pitch-based  features  to  a  grouping  cue  and  a  new  objective  function  is  proposed 
to  maximize  SNR.  Second,  a  multiscale  onset /offset  analysis  is  employed  to  form  reliable 
segments  in  the  high-frequency  range. 
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